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Abstract —Discovering causai reiationships from observationai data is a cruciai probiem and it has appiications in many research 
areas. The PC aigorithm is the state-of-the-art constraint based method for causai discovery. However, runtime of the PC aigorithm, in 
the worst-case, is exponentiai to the number of nodes (variabies), and thus it is inefficient when being appiied to high dimensionai data, 
e.g. gene expression datasets. On another note, the advancement of computer hardware in the iast decade has resuited in the 
widespread avaiiabiiity of muiti-core personai computers. There is a significant motivation for designing a paraiieiised PC aigorithm that 
is suitabie for personai computers and does not require end users’ paraiiei computing knowiedge beyond their competency in using the 
PC aigorithm. in this paper, we deveiop paraiiei-PC, a fast and memory efficient PC aigorithm using the paraiiei computing technique. 
We appiy our method to a range of synthetic and reai-worid high dimensionai datasets. Experimentai resuits on a dataset from the 
DREAM 5 chaiienge show that the originai PC aigorithm couid not produce any resuits after running more than 24 hours; meanwhiie, 
our paraiiei-PC aigorithm managed to finish within around 12 hours with a 4-core CPU computer, and iess than 6 hours with a 8-core 
CPU computer. Furthermore, we integrate paraiiei-PC into a causai inference method for inferring miRNA-mRNA reguiatory 
reiationships. The experimentai resuits show that paraiiei-PC heips improve both the efficiency and accuracy of the causai inference 
algorithm. 
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1 Introduction 

I Nvestigating the associations between variables has long 
been a main research topic in statistics, data mining 
and other research areas. For instance, to construct a gene 
regulatory network, we might want to find the associations 
between the expression levels of genes. We may use the 
observed association to explain that one gene regulates the 
expression of the other. However, the association does not 
tell us which gene is the regulator and which gene is the tar¬ 
get, or indicate the chance of a third gene regulating the two 
genes. Hence, associations between genes may not imply 
the cause-effect nature of gene regulatory relationships. 

There has been a long controversy over whether causal¬ 
ity can be discovered from observational data. Many sci¬ 
entists have attempted to design algorithms of inferring 
causality from observational data over the last few decades 
0, 0, 0, li|. As a result, there has been more and 
more evidence demonstrating the possibility of discovering 
causal relationships from non-experimental data in different 
research areas 0, 0, Especially, two Nobel prizes were 
awarded for such causality methods in the field of Eco¬ 
nomics in 2003 and 2011 respectively 0, 0. These works 
have strongly motivated the utilisation of observational data 
in exploring causal relationships in other important research 
areas, such as finding the genetic causes of cancers. 
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The standard method for discovering causality is a ran¬ 
domised controlled experiment. Eor example, to assess the 
effect of gene A on other genes, biologists use the gene 
knockdown experiment. In the experiment, they create two 
groups of samples, control and transfected groups. Gene 
A is knocked down in the control group whilst remains 
in the transfected group. The changes in the expression 
levels of the target genes between the two groups are the 
effects that gene A has on them. However, conducting 
such experiments would incur humongous costs given that 
thousands of genes are required to be tested. Moreover a 
randomised controlled experiment is generally impossible 
to conduct or restricted by ethical concerns in many cases. 
Therefore, discovering causality from observational data is 
a crucial problem. 

One of the most advanced theories with widespread 
recognition in discovering causality is the Causal Bayesian 
Network (CBN). In this framework, causal relationships 
are represented with a Directed Acyclic Graph (DAG) 0. 
Learning a DAG from data is highly challenging and com¬ 
plex as the number of possible DAGs is super-exponential 
to the number of nodes 0. There are two main approaches 
for learning a DAG: the search and score approach, and 
the constraint based approach. While the search and score 
approach raises an NP-hard problem, the complexity of the 
constraint based approach is exponential to the number of 
nodes. The high computational complexity has hindered 
the applications of causal discovery approaches to high 
dimensional datasets, especially gene expression datasets 
of which the number of genes (variables) is large and the 
number of samples is normally small. 
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A well-known constraint based algorithm is the Induc¬ 
tive Causation (IC) algorithm proposed in |j^. It is a concep¬ 
tual algorithm and was implemented in the PC (named after 
its authors, Peter and Clark) algorithm . The PC algorithm 
starts with a complete, undirected graph and deletes recur¬ 
sively edges based on conditional independence decisions. 
For example, the edge between A and 13 is removed if 
we can find a set S that does not include A and B, and 
when conditioning on S, A and B are independent. The 
PC algorithm has been implemented in various open-source 
sofware such as TETRAD (2), pcalg 0, and bnlearn jTo) , 
and has become a reliable tool for causal explorations. 

There have been several applications of the PC algo¬ 
rithm in Bioinformatics pT| , (12) , (T^ , (Ti) , (Ts) . Specifically, 
Zhang et al. | [Tl] | applied the PC algorithm which used 
mutual information for the conditional independence tests 
to learn the gene regulatory networks from gene expression 
data. Maathuis et al. ||^| used the PC algorithm to learn 
the causal structure of the gene regulatory networks, then 
they applied do-calculus Q to the learnt network to infer 
the causal effect that a gene has on the other. Le et al. 
p3) applied the approach in (^ to predict the targets of 
microRNAs (miRNAs, an important class of gene regulators 
at the post-transcriptional level). In a similar fashion, Zhang 
et al. (Til l inferred the miRNA-mRNA interactions that had 
different causal effects between biological conditions, e.g. 
normal and cancer. These applications of the PC algorithm 
have been shown to outperform other computational meth¬ 
ods that are based on correlation or regression analysis. 
Moreover, the PC algorithm is able to distinguish direct 
interactions from indirect interactions in gene regulatory 
networks 1[1^, (Ts) . 

TTowever, there are two main limitations of the PC 
algorithm, especially when applying to high dimensional 
biological datasets: (i) the runtime of the PC algorithm, 
in the worst case, is exponential to the number of nodes 
(variables), and thus it is inefficient when applying to high 
dimensional datasets such as gene expression datasets, and 
(ii) the result from the PC algorithm is variable ordered- 
dependent, i.e. when we change the order of the variables 
in the input dataset, the result may change. 

Several methods (Tt) , |^, (T^ have been proposed 
to improve the efficiency of the PC algorithm. They aim 
to introduce efficiency-improved alternative methods, but 
they rather compromise the accuracy. These methods either 
search for only specific causal structures or use heuristic 
functions to improve the efficiency of the algorithm. ITow- 
ever, the results from these heuristic methods are either 
incomplete or involving a high rate of false discoveries. 

Additionally, the order-dependence of the PC algorithm 
is problematic in high dimensional datasets. Colombo et 
al. ( 20 ) has shown experimentally that around 40% of the 
edges (2000 edges) learnt from a real gene expression dataset 
are not stable, i.e. these edges only appear in less than 
half of the results obtained with all the different orderings 
of nodes. This problem makes the knowledge inferred by 
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using the PC algorithm less reliable. To overcome the order- 
dependence of the PC algorithm, researchers proposed sev¬ 
eral modifications to the procedure of the algorithm 0, 
pO] , |2T) , ( 22 ). For instance, Colombo et al. ||^ proposed a 
modified version of the PC algorithm called stable-PC. The 
algorithm aims to query all the neighbours of each node and 
fix these neighbours while conducting conditional indepen¬ 
dence tests (Cl tests henceforward) at each level (based on 
size of the conditioning sets) of the PC algorithm. TTowever, 
this modification requires more CT tests and therefore results 
in even longer running time for this already inefficient 
algorithm. Our experiment on a dataset from the DREAM 
5 challenge (http://dreamchallenges. 0 rg/ l with i643 vari¬ 
ables and 805 samples shows that the order-independence 
version of the PC algorithm |20) takes more than 24 hours to 
run (unfinished). The inefficiency of the algorithm hinders 
its application in practice, as we would not estimate the 
runtime of a real world dataset up-front, and thus, pro¬ 
longed running time would prevent us from applying the 
algorithm. 

In another direction, parallel computing provides a 
straightforward approach to improving the efficiency of the 
algorithm, but it has been impractical for normal end users. 
The last decade has seen a fast growth of parallel computing 
frameworks, such as MapReduce framework p3] . TTowever, 
such frameworks need suitable cluster facilities and require 
some solid technical understanding from users. Meanwhile, 
end users, e.g. biologists, normally use personal computers 
or small lab servers for their everyday research, and may 
not easily acquire the required technical knowledge for 
applying parallelised data mining algorithms. On the other 
hand, the fast development of computer hardware in the 
last decade has resulted in the widespread availability of 
multi-core personal computers. It is common for a modem 
computer to have a four-core CPU, and small servers to 
have eight-core CPUs. Therefore, it is of great interest to 
design parallel algorithms that are suitable for personal 
computers. Specifically, we aim to design the parallelised 
PC algorithm (called parallel-PC hereafter) for multi-core 
personal computers. 

In this paper, we present a method to parallelise the 
order-independence PC algorithm in (^ for high dimen¬ 
sional datasets. We propose to parallelise the Cl tests at each 
level of the algorithm. The Cl tests at each level are grouped 
and distributed over different cores of the computer, and 
the results are integrated at the end of each level. Conse¬ 
quently, the runtime of our parallel-PC algorithm is much 
shorter than the original PC algorithm and the results of the 
parallel-PC algirhtm is independent of the variable ordering 
in a dataset. Importantly, it does not require any extra 
knowledge or installation effort from users. Our experiment 
results on both synthetic and real world datasets show that 
it is now practical to explore causal relationships with the 
PC algorithm in gene expression datasets using a multi-core 
personal computer. The proposed algorithm contributes to 
bridging the gap between computer science theory and its 
practicality in important scientific research areas. As an 
application, we modified the causal inference method in 
(10 , by replacing the PC algorithm with the parallel-PC 
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algorithm, to infer miRNA-mRNA regulatory relationships 
in three different cancer datasets. The experimental results 
show that our method not only outperforms the original 
causal inference method that uses the original PC algorithm, 
but also faster. 

We summarise the contributions of the paper in the 
following: 

1) Presenting a fast PC algorithm which is suitable 
for high dimensional datasets, and the results are 
independent of the order of the variables. To the best 
of our knowledge, this is the first work of utilising 
parallel computing for the PC algorithm. 

2) Providing the software tool and proving that it is 
reliable and efficient using a wide range of real 
world gene expression datasets. 

3) Modifying a causal inference method to infer 
miRNA-mRNA regulatory relationships with better 
performance. 

The rest of the paper is organised as follows. Section 
2 discusses related work, and Section 3 presents the PC 
algorithm and the stable-PC algorithm as well as the demon¬ 
strating examples. The proposed algorithm is presented 
in Section 4. Experiment results are shown in Section 5, 
Section 6 presents an application of parallel-PC in inferring 
miRNA-mRNA regulatory relationships, and finally Section 
7 concludes the paper. 

2 Related work 

The well-recognised methods for causality discovery are 
based on probabilistic graphical modeling g, (24). The 
structure of these graphical models is a DAG (directed 
acyclic graph), with its nodes representing random vari¬ 
ables and edges representing dependencies between the 
variables |j^. There are two main approaches to learning 
the causal structure from data: 1) search and score and 2) 
constraint based methods. The search and score methods 
1^ , search for all possible DAGs whilst using a 

scoring function to measure the fit of each DAG to the data. 
The DAG that best fits the data will be chosen. TTowever, 
learning Bayesian networks using this approach is an NP- 
hard problem (28), and the proposed methods following this 
approach were only able to cater for datasets with a limited 
number of variables, and therefore may not be suitable for 
high dimensional datasets. Meanwhile, the constraint based 
approach (^, ||^|, (^ uses Cl tests to remove non-causal 
relationships between variables. These methods are suitable 
for sparse datasets in practice [ |30| . 

Due to the significance of the PC algorithm in the con¬ 
straint based causal discovery approach, there have been 
several methods aiming to modify the procedure of the 
PC algorithm directly to improve the efficiency and/or the 
accuracy of the algorithm. Steck and Tresp (31) used the 
necessary path condition to reduce the number of Cl tests in 
the PC algorithm. They have proved that when testing the 
edge between X to Y, we do not need to condition on the 
nodes that are not in a path from X to Y. This modification 


may help reduce the chance for making errors by conduct¬ 
ing fewer tests. However, it requires more running time for 
finding the nodes that are not in the path. Other researchers 
(2^ , (32) proposed to replace the Cl test, which is normally 
based on the Chi-square statistical test, with the Bayesian 
statistical tests. They aim to reduce the error rates made by 
the Cl tests. However, there is still no clear evidence in real 
world datasets about the impacts of this replacement on the 
efficiency improvement of the PC algorithm. 

Meanwhile, the other constraint based methods |T( infer 
causal relationships by performing Cl tests and searching 
for specific causal structures. For instance. Cooper 
proposed the LCD method to learn only a small portion 
of edges rather than the complete network. Specifically, 
the LCD algorithm searches for the CCC structures which 
consist of three pairs of correlated variables: {A, B), {A, 
C) and {B, C). If A and C become independent when 
conditioned on B, we may infer that one of the following 
causal relations exists between A, B, and C\ A B ^ C; 
A ^ B ^ C; and A ^ B ^ C. Silverstein et al. |l6l | 
proposed the similar method that focuses on the CCU struc¬ 
tures. The CCU method searches for two correlated pairs 
{A,B) and {A,C) and an independent pair [B,C). When 
conditioning on A, if {B, C) becomes dependent then the 
causal relationships B ^ A C are concluded. Compared 
to the PC algorithm, such specific structure finding methods 
are more efficient, but the causal discoveries are incomplete. 

Due to the inefficiency of algorithms concerning causal 
relationships between all pairs of variables in a dataset, sev¬ 
eral methods have been proposed to discover only the local 
causal relationships from data, i.e. the relationships between 
a specific target and its neighbours. A popular approach is to 
identify the Markov blanket of a target variable. In a causal 
graph, Markov blanket has two equivalent definitions: 1) 
the Markov blanket of a node T is a set MB{T) such 
that T is independent of other nodes when conditioning on 
MB{T), or 2) the Markov blanket of a node T comprises its 
parents, children, and children's parents (spouses). Based 
on these two definitions, two main approaches have been 
proposed to discover the Markov blanket, the grow-and- 
shrink approach and the divide-and-conquer approach. The 
grow-and-shrink approach is based on the first definition 
and it uses the heuristic function in the "grow" step to 
find the superset of the true Markov blanket, then the false 
positives in this super set will be removed in the "shrink" 
step. Some well-known methods in this approach include 
GS (Grow and Shrink) (^, IAMB (Incremental Association 
Markov Blanket) (Tt) , and IAMB variants (M) . The divide 
and conquer approach uses the topology constraints in 
the second definition of Markov blanket. This approach 
firstly finds the parents and children of the target node, 
then the spouses can be found by searching through the 
parents and children of the identified nodes in the first step. 
This approach attracts several methods namely HITON- 
PC/MB (35), MMPC-MB |@, semi-lnterleaved HITON-PC 
(3^, PCMB (3g, and IPC-MB (^ (see (S^ for a review). 

In another direction, Jin et al. (3^ proposed an efficient 
method for testing the persistent associations between the 
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target variable and its neighbours. The method is based on 
partial association tests and association rule mining frame¬ 
work to remove the spurious associations. In a similar fash¬ 
ion, Li et al. integrated the ideas of retrospective 

cohort study with the association rule mining framework. 
They proposed to divide the samples into two groups of 
individuals, who share common characteristics but differ in 
regard to a certain factor of interest. The newly-designed 
dataset is called "fair dataset" and is used to infer the level 
of influence that the factor of interest affects the target vari¬ 
able. These works were extended in ||40), and the software 
tools were also provided. However, these methods are only 
applicable for binary datasets with a fixed target variable, 
and thus may not be suitable for high dimensional datasets, 
e.g. gene expression data. 

The common characteristic of most of the above- 
mentioned works is that they aim to propose efficiency- 
improved alternative methods to the PC algorithm. How¬ 
ever, the results from these heuristic methods may involve 
a high rate of false discoveries. 

Scutari (IT] utilised parallel computing technique for 
constraint based local causal discovery methods. Mean¬ 
while, Chen et al. (42) proposed a method to parallelise the 
Bayesian network structure learning algorithm using search 
and score approach. However, there is still no existing work 
that parallelises the PC algorithm. 

3 The original-PC and stable-PC algo¬ 
rithms 

3.1 Notation, definitions, and assumptions 

Let G = (V, E) be a graph consisting of a set of vertices V 
and a set of edges E C V x V. The set of vertices that are 
adjacent to H in graph G is defined as: adj{A, G) = {B : 
{A, B) G E or{B, A) G E}. B is called a collider if we have 
the v-structure: A ^ B G. 

Graph G is a Directed Acyclic Graph (DAG) if G contains 
only directed edges and has no directed cycles. The skeleton 
of a DAG G is the undirected graph obtained from G by 
ignoring the direction of the edges. An equivalence class of 
DAGs is the set of DAGs which have the same skeleton 
and the same u-structures. An equivalence class of DAGs 
can be uniquely described by a completed partially directed 
acyclic graph (CPDAG) which includes both directed and 
undirected edges. 

Definition 1 (d-separation) (43) . In a DAG, a path p between 
vertices A and B is active (d-connecting, where "d" stands for 
dependence) relative to a set of vertices C {A, B C) if (i) every 
non-collider on p is not a member of C; (ii) every collider on p is 
an ancestor of some member of C. Two sets of variables A and B 
are said to be d-separated by C if there is no active path between 
any member of A and any member of B relative to C. 

Let P be the joint probability distribution of V. The 
following assumptions are set when applying the PC algo¬ 
rithm. 
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Assumption 1. (Causal Markov Condition 1^) G and P satisfy 
the Causal Markov Condition if and only if given the set of all 
its parents, a node of G is probabilistically independent of all its 
non-descendents in G. 

Assumption 2. (Faithfulness Condition Q) G and P satisfy the 
Faithfulness Condition if and only if no conditional independence 
holds unless entailed by the Causal Markov Condition. 

Assumption 3. (Causal sufficiency For every pair of vari¬ 
ables which have their observed values in a given dataset, all their 
common causes also have observations in the dataset. 

3.2 The original-PC algorithm 

The PC algorithm (2) (original-PC algorithm henceforward) 
has two main steps. In the first step, it learns from data a 
skeleton graph, which contains only undirected edges. In 
the second step, it orients the undirected edges to form 
an equivalence class of DAGs. As the first step of the PC 
algorithm contributes to most of the computational costs, we 
only focus on the modification of this skeleton learning step 
in this paper, and information about the edge orientation 
step can be found in (2). 

The theoretical foundation of the original-PC algorithm 
(^ is that if there is no link (edge) between nodes X and Y, 
dien there is a set of vertices Z that either are neighbours of 
X or Y such that X and Y are independent conditioning on 
Z. In other words, Z disconnects X and Y. This foundation 
is presented as Theorem 1 below. 

Theorem 1 (^.1/ vertices X and Y are not adjacent in a 
DAG G, then there is a set of vertices Z which is either a subset 
of adj{X,G)\Y or a subset of adj(Y,G)\X such that Z d- 
separates X and Y in G. 

Based on Theorem 1, a naive approach for learning the 
skeleton is that we can search exhaustively for all possible 
neighbours of X and Y and perform conditional indepen¬ 
dence tests to justify if there is a set Z disconnecting X 
and Y. Obviously, this is an inefficient approach, as the 
neighbours of X and Y are unknown, and thus the search 
space is all variables other than X and Y. In the following 
we explain how the original-PC algorithm works and list 
the details of the skeleton learning step of the original-PC 
algorithm in Algorithm 1. 

In the skeleton learning step (Algorithm 1), the orginal- 
PC algorithm starts with the fully connected network and 
uses the conditional independence tests to decide if an edge 
is removed or retained. For each edge, the PC algorithm tests 
if the pair of variables connected by the edge, X and Y, are 
independent conditioning on a subset Z of all neighbours of 
X and Y. The Cl tests are organised by levels (based on the 
size of the conditioning sets, e.g. the depth d). At the first 
level (d = 0), all pairs of vertices are tested conditioning on 
the empty set. Some of the edges would be deleted and the 
algorithm only tests the remaining edges in the next level 
(d = 1). The size of the conditioning set, d, is progressively 
increased (by one) at each new level until d is greater than 
the size of the adjacent sets of the testing vertices. The key 
feature that makes the PC algorithm efficient in sparse true 
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Algorithm 1: Step 1 of the original-PC algorithm: learn¬ 
ing the skeleton 

Input: Dataset D with a set of variables V, and 
significant level a 

Output: The undirected graph G with a set of edges E 
Assume all nodes are connected innitially 
Let depth d = 0 

repeat 

for each ordered pair of adjacent vertices X and Y in 

G do 

if (|adj(X, G')\{F}| >= d) then 

for each subset Z C adj{X, G')\{F} and 
\Z\ = d do 

Test I{X,Y\Z) 
if I{X,Y\Z) then 

Remove edge between X and Y 
Save Z as the separating set of 
iX,Y) 

Update G and E 

break 

end 

end 

end 

end 

Let d = d + 1 

until jorf/XX, G)\{E}| < d for every pair of adjacent 
vertices in G; 


underlying graphs is that the neighbours of a particular 
node are dynamically updated when an edge is removed. 
Therefore, the number of conditional independence tests is 
small when the true graph is sparse Q. If the conditional 
independence tests are truly correct, then the PC algorithm 
will return the true graph as stated in Theorem 2 |2). 

Theorem 2 |2|. Let the distribution P be faithful to a DAG 
G = (V; E), and assume that we are given perfect conditional 
independence information about all pairs of variables {X,Y) in 
V given subsets S. Then the output of the PC-algorithm is the 
CPDAG that represents G. 

Figure 1 (top) shows an example of original-PC algo¬ 
rithm being applied to a dataset with four nodes. A, B,G, 
and D. The original-PC algorithm starts with the fully 
connected graph. At the first level, all edges are tested 
conditioning on the empty set. In the example, after level 
1 tests, there are three edges left. At the next level, each 
remaining edge will be tested conditioning on each neigh¬ 
bour (size 1) of the testing variables. For example, with the 
edge between A and B we have at most two tests which 
are conditioning on G and conditioning on D. If the test 
returns independence (e.g. /(A, i?|G)), we remove the edge 
from the graph and move to testing the other edge. The 
procedure will stop when there is no test to perform. 

3.3 The stable-PC algorithm 

The Cl tests in the original-PC algorithm are prone to 
mistakes as the number of samples is limited. Moreover, 
incorrectly removing or retaining an edge would result in 
the changes in the neighbour sets of other nodes, as the 
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Fig. 1: Examples of applying PC (top) and stable-PC (bot¬ 
tom). 


graph is updated dynamically. Therefore, the output graph 
is dependent on the order in which we perform the condi¬ 
tional independence tests. In other words, given a lexical 
order of performing the tests, the order of the variables 
in the input data will affect the output graph. In Figure 1 
(top), at the beginning of level 2, after we perform test #10, 
the edge between A and B is removed and the graph is 
updated. This also means that the adjacent set of A is also 
updated, i.e. adjjA, G) = {G, D}. Therefore, when we test 
the edge between A and G, we do not need to condition 
on B. The updates make the original-PC algorithm efficient, 
but they also cause problems when the Cl tests are making 
mistakes. If test #10 wrongly removes the edge between A 
and B, then we miss the test I{A,G\B) which may remove 
the edge between A and G. Moreover, if we test I{A, G\B) 
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first and remove the edge between A and C, we will end up 
with a different graph. In other words, when we change the 
order of the variables in the dataset, e.g. A, C, B, and D, the 
result of the original-PC algorithm may change. 

Colombo et al. (20 | proposed a modification to the 
original-PC algorithm to obtain a stable output skeleton 
which does not depend on how variables are ordered in the 
input dataset. In this method (called stable-PC algorithm), 
the neighbour (adjacent) sets of all nodes are queried and 
kept unchanged at each particular level. As a result, an 
edge deletion at one level does not affect the conditioning 
sets of the other nodes, and thus the output is independent 
with the variable ordering. For instance, at the beginning of 
level 2 in the above example, the adjacent set of node A is 
adj{A, G) = {B, C, D}. This adjacent set is kept unchanged 
at this level. After performing test #10, we remove the edge 
between A and B in the graph, but the adj{A,G) is not 
updated. Therefore, we still consider B as an adjacent node 
of A, and tests #11 and #12, still need to be performed 
as shown in Figure 1 (bottom). Test #11 removes the edge 
between A and C, and therefore generates different output 
graph compared to the original-PC algorithm. 

However, this modification requires performing more 
conditional independence tests in each level of the algo¬ 
rithm, and thus it increases further the runtime of the algo¬ 
rithm. In the following section, we adopt this modification 
to design a modified PC algorithm which is both efficient 
and order-independent. 

4 Parallel-PC algorithm 

4.1 Parallel Computing: Potentials and Challenges 

Computer hardware technology has been advanced signifi¬ 
cantly in the last decade. Nowadays, it is common to have 
personal desktop computers and even laptops have 4 cores 
in their CPUs. Likewise, the number of cores in today's 
servers has jumped to 8, 16, 32 or more. These resources 
provide a great potential of utilising parallel computing 
for important algorithms, including the PC algorithm, for 
personal computers. In fact, the majority of end users are 
using their personal computers for doing research. There¬ 
fore, parallel algorithms would have significant impact if 
they work well for personal computers and do not require 
users' technical knowledge about parallelism and high- 
performance clusters. 

The idea of parallelism is that we break down a big task 
into several different smaller subtasks and distribute them 
over different cores of the computer's CPU to run in parallel. 
The results from all subtasks will then be integrated to form 
the result of the original task. Theoretically, this divide- 
and-conquer approach can speed up an algorithm by the 
maximum number-of-core times. This approach, however, 
requires the subtasks to be unrelated. In other words, the re¬ 
sult from each subtask must be independent of one another 
and no communication is allowed between them. 

The independence requirement has been a great barrier 
for many sequential algorithms, including the stable-PC 


algorithm, to be parallelised. In the stable-PC algorithm, 
the performance bottleneck is the huge number of Cl tests, 
which can reach hundreds of millions in real-world datasets. 
The Cl tests, however, are order-dependent across different 
levels. Since the stable-PC algorithm updates the adjacent 
sets of all nodes at each level after all the Cl tests of the 
level are completed, the test results of a particular level will 
influence the results of the next level. Therefore, it seems 
infeasible to parallelise the stable-PC algorithm as a whole. 

4.2 The parallel-PC Algorithm 

Our proposed strategy is to parallelise the Cl tests, not 
across different levels, but inside each level of the stable 
PC-algorithm. This approach is feasible because the Cl tests 
at a particular level are independent of each other. Since the 
graph is only updated at the end of each level, the result 
of one Cl test does not affect the others. Therefore, the Cl 
tests at a level can be executed in parallel without changing 
the final result. Furthermore, this approach also enjoys the 
advantage of knowing the number of Cl tests of each level 
in advance. This allows the Cl tests to be evenly distributed 
over different cores, so that the parallelised algorithm can 
achieve maximum possible speedup. 

Within a level (depth) of Algorithm 2, we employ the 
parallel computing paradigm to parallelise their executions 
(the parallelisation step). At each level with the condition¬ 
ing sets of size d, the sequential process in the stable-PC 
algorithm is replaced by a three-staged process: (1) the Cl 
tests are distributed evenly among the cores, (2) each core 
performs its own sets of Cl tests in parallel with the others, 
and (3) the test results in all cores are integrated into the 
global graph. This three-staged process is applied at all the 
levels (the depth d) of the algorithm. 

We observed that it is not efficient to distribute the Cl 
tests of the same edge to different cores. Given an edge 
between X and Y, we need to perform the Cl tests con¬ 
ditioning on the variables in adj{X,G) and in adj{Y,G). 
However, if we firstly perform the Cl tests conditioning 
on variables in adj{X, G) and the edge between X and Y 
is removed, then the Cl tests conditioning on variables in 
adj{Y, G) are unnecessary. We resolve this dependency by 
grouping the Cl tests of the same edge together, instead of 
separating them as in the stable-PC algorithm. In Algorithm 
2, if the Cl tests between X and Y return independence 
when conditioning on Z C adj{X,G)\{Y}, the algorithm 
will not perform the Cl tests conditioning on variables 
in adj{Y,G). Grouping those tests together, the algorithm 
ensures the Cl tests of the same edge will not be distributed 
to different cores of the CPU, and thus reduce the number 
of unnecessary Cl tests that need to be performed. 

In Algorithm 2, we provide an optional input parameter, 
memory-efficient indicator s. If s is set to TRUE, the algo¬ 
rithm will detect the free memory of the running computer 
to estimate the number of edges in G that will be distributed 
evenly to the cores. This step is to ensure that each core of 
the computer will not hold off a big amount of memory 
while waiting for the synchronisation step. The memory- 
efficient procedure may consume a little bit more time. 
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Algorithm 2: The parallel-PC algorithm 
Input: Dataset D, significant level a, P cores, 

memory-efficient indicator s, number of edges 
per batch tf, 

Output: The undirected graph G with a set of edges E 
Assume all nodes are connected in graph G 
Let depth d = 0 

repeat 

Query and fix the adjacent set adj{X, G) of each 
node A in G 

Compute the set J of unordered pairs of adjacent 

vertices (A, Y) in G 

// Parallelisation Step 

for each batch oftt edges (ti, = \J\ if s = FALSE) 

do 

Distribute the edges in the batch evenly into P 
cores, each with Jp edges 
for each core p = 1.. . P in parallel do 
for each pair (A, A) € Jp do 

Let Y indicate if (A, Y) is adjacent, 
initialize y — TRUE 
// On A' s neighbours 
if |aJj(A, G)\{A}| > Jthen 

for each subset Zx Q adj{X, G)\{A} 
and \Zx \ = d do 

if I{X,Y\Zx) then 
= FALSE 

break 

end 

end 

end 

// On Y' s neighbours 
if \adj{Y, G)\{A}| > d then 

for each subset Zy Q adj{Y, G)\{A} 
and \Zy\ = d do 

if/(A, AI Zy) then 

kP^Y = false 

break 

end 

end 

end 

end 

end 

// Synchronisation Step 
for each core p = 1... P do 
for each pair (A, A) S Jp do 
if y = FALSE then 

Remove the edge between A and A 
and update G and E 
end 
end 
end 
end 

Let d = d + 1 

until |aJj(A, G)\{A}| < d for every pair of adjacent 
vertices in G; 


However, this option is recommended for computers with 
limited memory resources or for big datasets. 


4.3 Implementation 

We implement our parallel-PC algorithm using R (44) . Our 
implementation is based on the stable-PC algorithm in the 
R package pcalg 0 and the native R library parallel. The 
implementation has been tailored to work in Linux, MacOS, 
and Windows operating systems. The codes of parallel-PC 
with and without the memory efficient option together with 
the instructions of running the algorithm are available at: 
http://nugget.unisa.edu.au/ParallelPC 

5 Experimental evaluation and applica¬ 
tions 

5.1 Datasets 


TABLE 1: Real world gene expression datasets 


Dataset 

#samples 

#variables 

stable-PC runtime 

NCI-60 

47 

1190 

13.56 mins 

MCC 

88 

1380 

59.87 mins 

BR51 

50 

1592 

68.37 mins 

S. cerevisiae 

63 

5361 

207.41 mins 

S. aureus 

160 

2810 

477.20 mins 

DREAMS-Insilico 

805 

1643 

> 24 hours 


In this section, we evaluate the efficiency of the proposed 
algorithm by applying it to five real world datasets and 
a synthetic dataset (see Table 1 for a summary of the 
datasets). The first three datasets are commonly used in 
research of inferring the regulatory relationships between 
microRNAs (a type of gene regulators) and genes. These 
datasets were downloaded from (45). Meanwhile, the last 
three datasets are used for discovering the regulatory re¬ 
lationships between transcription factors (another type of 
gene regulators) and the target genes regulated by them. 
The S. cerevisiae dataset was provided by the authors of |T^, 
and the last two datasets were downloaded from |46). the 
number of samples and variables in each dataset together 
with the runtime of the stable-PC algorithm are summarised 
in Table 1. We choose these datasets for the experiments, 
as they require a wide range of runtime when applying 
the stable-PC algorithm. All of the datasets are available at: 
nugget. unisa .edu.au/ParallelPC 

5.2 Efficiency evaiuations 

We apply the stable-PC algorithm, and the parallel-PC algo¬ 
rithm to all of the datasets and compare their performance in 
terms of the runtime. For each dataset, we run five versions 
of the parallel-PC algorithm using 4, 5, 6, 7, and 8 cores 
CPU respectively. Meanwhile, for the stable-PC algorithm, 
we used the implementation in the pcalg Q R package. We 
choose the range from four to eight cores for the experi¬ 
ments, as a modem personal computer normally has four 
cores and a small server usually has eight cores. We run all 
the experiments on a Linux server with 2.7 GB memory and 
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Fig. 2: Runtime of the algorithms on the six datasets. We run each experiment three times, and each value in the figure 
is the average runtime of the three runs. The average variation coefficient (standard deviation / mean) of the PC-parallel 
algorithm in the three runs is 0.039 and the maximum variation coefficient is 0.055. 


2.6 Ghz per core CPU. All experiments were run three times 
and the average value of the three runs was reported. 

As shown in Figure 2, the runtime of the parallel-PC 
algorithm is much shorter compared to the stable-PC algo¬ 
rithm. Please note that the parallel-PC algorithm generates 
the results that are independent of the order of the variables 
in the datasets, and the output results are consistent with 
the stable-PC algorithm. 

In the relatively small datasets, NCl-60, stable-PC took 
around 14 minutes to complete, but parallel-PC with 4-core 
CPU only spent around 5 minutes. For this small dataset, 
increasing the number of cores does not reduce the runtime 
significantly due to the cost of synchronisation between the 
cores. 

For the BR51 and MCC datasets, it took more than one 
hour each for running the stable-PC algorithm. The 6-core 
parallel-PC algorithm only used less than one third (less 
than 20 minutes) of its counterpart's runtime. Meanwhile, 
there is a clear trend in the S. cerevisiae dataset that the more 
powerful CPU (up to 15 cores) we use, the shorter runtime 
the parallel-PC algorithm has. 

The efficiency of the parallel-PC algorithm is even higher 
with big datasets. In the S.aureus dataset, the runtime of the 
stable-PC algorithm was around 8 hours. Meanwhile, using 
a 4-core version of the parallel-PC algorithm, it took only 
around two hours. This is a significant result for end users 


who only use personal computers for doing their research 
experiments. The runtime is further reduced when using 
computer CPU with more cores. For instance, it took just 
more than one hour (76.1 minutes) for running this dataset 
with an 8-core CPU computer (server). With the DREAM5- 
Insilico dataset, after more than 24 hours the stable-PC still 
could not produce any results, and we had to abort the 
job. This is a typical situation when applying the stable-PC 
algorithm to big datasets, as we may not be able to predict 
its running time on a real world dataset. However, our 4- 
core parallel-PC algorithm managed to finish within around 
12 hours, and the 8-core version took less than 6 hours (341.2 
minutes). 


The experiment results suggest that our proposed al¬ 
gorithm has significantly shorter runtime than the stable- 
PC algorithm. Given that our method generates the same 
output and requires no extra effort from users compared 
to that of the stable-PC algorithm, the parallel-PC is more 
useful in exploring causal relations in high dimensional 
data, especially gene expression data. Note that the pro¬ 
vided software tool is not restricted to only personal PCs, 
and it is ready for clusters or super computers with CPUs of 
more cores. To provide more insight into the experiments in 
Figure 2, we run the experiments on a super computer (with 
a 48-core CPU) using up to 16 cores of the CPU and the 
results are in Figure 3. For most of the datasets, the runtime 
was not improved further when we used more than 14 cores 
of the CPU. 
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Fig. 3: Runtime of the algorithms using up to 16 cores with the six datasets. 


5.3 Scalability 

To compare the scalability of the parallel-PC algorithm 
and the stable-PC algorithm, we used the DREAMS-insilico 
dataset as a base and randomly sample from it four datasets 
with different numbers of samples (200, 400, 600, and 805) 
and five datasets with different numbers of variables (500, 
750, 1000, 1250, and 1500). 

Figure 4 (top) shows that the runtime of the stable-PC 
algorithm increases significantly when the number of vari¬ 
ables is increased in a dataset. The reason is that the number 
of Cl tests is increased when the number of variables is 
increased. The parallel-PC algorithm suffers from the same 
problem, but the increasing trend of the runtime is less 
significant compared to that of the stable-PC algorithm. 

Similarly, the runtime of both PC and parallel-PC goes 
up when we increase the number of samples in the dataset 
as shown in Figure 4 (bottom). However, the reason for 
runtime increase is not because of the increased number of 
Cl tests, but due to the increase in the time taken to perform 
a test. When we increase the number of samples in the input 
dataset, the time for performing one Cl test is increased, and 
thus resulting in the large increase in total runtime given 
that the number of Cl tests is huge. 

5.4 Impact of the memory-efficient option 

As mentioned in Section 4, the memory-efficient version 
of the algorithm uses less amount of memory at any time 
point while running the algorithm. We observed that peak 
memory usage in the algorithm with memory-efficient op¬ 
tion (s=TRUE) is about half of that of the algorithm without 


using the option across the six datasets in Table 1. However, 
there is a trade-off between memory efficiency and time effi¬ 
ciency as show in Table 2. The experiments were conducted 
on a MacBook Pro with 8GB RAM, 2.6 GHz. Running S. 
cervisiae with s — FALSE failed to finish due to memory 
exhaustion. 


TABLE 2: Impact of the memory-efficient option 


Dataset 

Time (mins) 

Memory (MBs) 

s = F 

s = T 

s = F 

s = T 


NCI-60 

2.2 

2.9 

919 

421.5 

MCC 

9.4 

11.6 

1323 

533.5 

BR51 

11.9 

15.3 

1593 

679.4 

Insilico 

531.7 

557.9 

1533 

784 

S. aureus 

92.1 

91.3 

4038 

1857.1 

S. cerevisiae 

NA 

91.3 

>8000 

6004 


6 Application in inferring miRNA-mRNA 

REGULATORY RELATIONSHIPS 

6.1 Computational methods for identifying mlRNA tar¬ 
gets 

microRNAs (miRNAs) are important gene regulators at 
post-transcriptional level. They regulate gene expression 
by promoting messenger RNA (mRNA) degradation and 
repressing translation [ |^ . miRNAs control a wide range 
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Fig. 4: Scalibility of the parallel-PC algorithm with number 
of attributes (top) and with data size (bottom). 

of biological processes and are involved in several types of 
cancers 1^ . In the last decade, predictions of miRNA 
functions through identifying miRNA-mRNA regulatory 
relationships by computational methods have increasingly 
achieved promising results. Computational approaches are 
proving to be effective in generating hypotheses to assist 
with the design of wet-lab experiments for confirming 
miRNA targets. 

Several methods have been proposed to identify 
miRNA-mRNA regulatory relationships. In the first stream, 
researchers identified miRNA-mRNA interactions using se¬ 
quence data (49), (g, 1^ . Although these methods can 
predict potential miRNA targets, the results may contain 
a high-rate of false positives and false negatives (5^ . In the 
second stream of research, various computational methods 
have been devised to use expression profiles in the study 
of mlRNA-mRNA regulatory relationships. The principle of 
these methods is to investigate if a change in the mlRNA 
expression level would result in a change in the mRNA 
expression. Some highlights of the methods are correlation 
analysis (5^, (^, regression mode ls (SS) , (^, population- 
based probabilistic learning model |57), rule based methods 
(58) , Bayesian network learning |59), |60|, (bT) , causal infer¬ 
ence techniques (^, 1^^, (^, and ensemble methods (6^ . 


Please refer to for a survey of different miRNA target 
prediction methods. 

6.2 Inferring miRNA-mRNA reguiatory reiationships 
with efficient-IDA 

In this section, we propose to use parallel-PC to modify 
the the causal inference method ||^| for inferring miRNA- 
mRNA regulatory relationships. The method is based on 
a causal inference approach called IDA (Intervention effect 
when the DAG is absent) |[^|. IDA firstly learns the causal 
structure from data using the PC algorithm, it then uses do- 
calculus to estimate the causal effect that a variable has on 
the other. The estimated causal effects simulate the effects of 
randomised controlled experiments. Although the method 
is proved to be effective in uncovering miRNA-mRNA reg¬ 
ulatory relationships, it has high computational complexity 
and suffers from the order-dependent problem. In this sec¬ 
tion, we propose to tackle these problems by modifying the 
IDA-based method by using parallel-PC rather than PC as 
the first step of the IDA algorithm. The modified algorithm 
is called efficient-IDA and includes the following steps. 

• Step 1: Find miRNAs and mRNAs that are differ¬ 
entially expressed between different conditions, e.g. 
normal vs cancer. The expression profiles of those 
miRNAs and mRNAs will be combined into a dataset 
where rows are samples and columns are variables 
(miRNAs and mRNAs). In our experiments, we 
download from (45) three processed cancer datasets 
with differentially expressed profiles. 

• Step 2: Apply parallel-PC to learn the causal struc¬ 
ture from data. parallel-PC will help tackle the effi¬ 
ciency and order-dependency problems of the origi¬ 
nal PC algorithm. 

• Step 3: Use do-calculus to estimate the causal effects 
that the miRNAs have on the mRNAs. For more 
information on this step, please refer to (13) . 

The running time of IDA is quite similar to the PC 
algorithm, as the time for the inference step (Step 3 as 
described above) is small. Therefore we do not compare the 
runtime of efficient-IDA and IDA again and please refer to 
Figures 2 and 3 for details. In the following sub-sections, we 
compare the performance of efficient-IDA and IDA in cor¬ 
rectly identifying miRNA-mRNA regualtory relationships. 

6.3 efficient-IDA discovers more experimentally miRNA 
targets than IDA 

We applied efficient-IDA and IDA to NCI-60, MCC, and 
BR51 and compare their performance on discovering the 
experimentally confirmed miRNA targets. 

To validate the target predictions from the two methods, 
we used the combination of four different experimentally 
confirmed miRNA target databases as the ground truth. 
They are Tarbase v6.0 (^ , miRecords v2013 |j^, mlRWalk 
v2.0 | |66) , and miRTarBase v4.5 |67). Tarbase, miRecords, and 
miRTarBase include verified interactions that are manually 
curated from the literature, and miRWalk contains both 
the predicted and the experimentally validated interactions. 
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Fig. 5: Number of confirmed miRNA targets identified by 
IDA and efficient-IDA. The top figure: the top 100 targets of 
each of the miRNAs in a dataset are extracted for validation. 
The bottom figure: the top Ik, 5k, 10k, 15k, and 20k interac¬ 
tions in each dataset (from left to right, NCT60 (EMT), BR51, 
and MCC) are extracted respectively for validation. 


Respectively for Tarbase, miRecords, miRWalk, and miRTar- 
Base, we have 20095 interactions with 228 miRNAs, 21590 
interactions with 195 miRNAs, 1710 interactions with 226 
miRNAs, and 37372 interactions with 576 miRNAs. After 
removing the duplicates, we have in total 62858 unique 
interactions to be used in the validation. 


We used two different ways to extract prediction re¬ 
sults for the comparison. Firstly, we extracted the top 100 
predicted targets of each of the miRNAs in a dataset and 
validate them against the experimentally confirmed miRNA 
targets. Secondly, we extracted the top 1000, 2000, 5000, and 
10000 predicted miRNA-mRNA interactions of each method 
for validation. 


Figures 5 shows the comparison results of the number 
of confirmed mlRNA targets of efficient-IDA and IDA us¬ 
ing the two validation methods respectively. We can see 
from the figure that efficient-IDA discovers more confirmed 
miRNA targets than IDA in both of the cases in all datasets. 
The experiment results suggest that efficient-IDA is not 
just faster in runtime, but the order-independent property 
of efficient-IDA helps improve the accuracy of the causal 
inference method (IDA). 


7 Conclusions AND FUTURE WORK 

It is very important to explore causal relationships in real 
world high dimensional datasets. For instance, the discov¬ 
ered causal relationships can help elucidate the causes of 
fatal diseases from gene expression data. However, existing 
causal discovery method do not provide practical solutions 
to the problem. In this paper, we have developed an effi¬ 
cient algorithm for exploring causal relationships in high 
dimensional datasets based on the stable-PC algorithm and 
parallel computing framework. We modify the stable-PC 
algorithm and group the Cl tests in the algorithm to enable 
the tests to be performed in parallel. Our method pro¬ 
duces the causal relationships that are consistent with the 
stable-PC algorithm, but much more efficiently regarding 
the runtime. The experiment results from a wide range of 
commonly used real world datasets suggest the efficiency of 
the proposed method. parallel-PC also shows its usefulness 
by improving the efficiency and accuracy of IDA in inferring 
miRNA-mRNA regulatory relationships. 

The effectiveness of the parallel-PC algorithm could be 
further improved with a job weighting and scheduling 
scheme. In the current version of parallel-PC, we divide 
the number of Cl tests equally and distribute to different 
cores of the CPU. However, there is no guarantee that the 
execution time of different cores is the same. A smart job 
weighting and scheduling scheme would help speed up the 
algorithm. 

Finally, a straightforward but useful future work is to 
apply the parallel-PC algorithm to other PC related causal 
discovery and causal inference methods. Most of the con¬ 
straint based methods are based on the idea of the PC 
algorithm directly or indirectly. The FCI Q algorithm and 
RFCI j68| are based on the PC algorithm to learn causal 
graphs which allows latent variables. The CCD algorithm 
(6^ for learning Markov equivalent class utilises the PC 
algorithm as the first step. The parallelised versions of these 
algorithms would help bridging the gap between computer 
science theory and their applications in practice. 
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